1 Introduction

Welcome to an EDA of the McDonald’s India Menu - Nutrition dataset, uploaded by Deep Contractor.

You don’t have to be a health expert to know fast food isn’t good for you. But, how bad is it truly? Can we find patterns in the design of their menu? Let’s explore.

2 Initial Setup

Read through the initial setup in the 4 tabs below.

2.1 Libraries

First, some I import some useful libraries and set some plotting defaults.

# Data Manipulation
library(dplyr)
library(tidyr)
library(readr)
library(skimr)
library(purrr)
library(stringr)
library(urltools)
library(magrittr)

# Plots
library(ggplot2)
library(naniar)
library(packcircles)
library(ggridges)
library(ggbeeswarm)
library(patchwork)

# PCA
if(!require(FactoMineR))
  remotes::install_cran("FactoMineR")
if(!require(factoextra))
  remotes::install_cran("factoextra")
library(FactoMineR)
library(factoextra)

# Tables
library(reactable)

# Settings
theme_set(theme_minimal(
  base_size = 13,
  base_family = "Menlo"))
theme_update(
  plot.title.position = "plot"
)

2.2 Read In

Let’s start be reading in the data. There is only one CSV file, with the menu items and some measurements on each item. {janitor::clean_names} helps us get clean column names quickly.

dat <- read_csv("../input/mcdonalds-india-menu-nutrition-facts/India_Menu.csv") %>% 
  janitor::clean_names()
glimpse(dat, 100)
## Rows: 141
## Columns: 13
## $ menu_category        <chr> "Regular Menu", "Regular Menu", "Regular Menu", "Regular Menu", "Regu…
## $ menu_items           <chr> "McVeggie™ Burger", "McAloo Tikki Burger®", "McSpicy™ Paneer Burger",…
## $ per_serve_size       <chr> "168 g", "146 g", "199 g", "250 g", "177 g", "306 g", "132 g", "87 g"…
## $ energy_k_cal         <dbl> 402.05, 339.52, 652.76, 674.68, 512.17, 832.67, 356.09, 228.21, 400.8…
## $ protein_g            <dbl> 10.24, 8.50, 20.29, 20.96, 15.30, 24.17, 7.91, 5.45, 15.66, 15.44, 21…
## $ total_fat_g          <dbl> 13.83, 11.31, 39.45, 39.10, 23.45, 37.94, 15.08, 11.44, 15.70, 14.16,…
## $ sat_fat_g            <dbl> 5.34, 4.27, 17.12, 19.73, 10.51, 16.83, 6.11, 5.72, 5.47, 5.79, 7.63,…
## $ trans_fat_g          <dbl> 0.16, 0.20, 0.18, 0.26, 0.17, 0.28, 0.24, 0.09, 0.16, 0.21, 0.18, 0.2…
## $ cholesterols_mg      <dbl> 2.49, 1.47, 21.85, 40.93, 25.24, 36.19, 9.45, 5.17, 31.17, 32.83, 66.…
## $ total_carbohydrate_g <dbl> 56.54, 50.27, 52.33, 59.27, 56.96, 93.84, 46.36, 24.79, 47.98, 38.85,…
## $ total_sugars_g       <dbl> 7.90, 7.05, 8.35, 3.50, 7.85, 11.52, 4.53, 2.73, 5.53, 5.58, 5.88, 2.…
## $ added_sugars_g       <dbl> 4.49, 4.07, 5.27, 1.08, 4.76, 6.92, 1.15, 0.35, 4.49, 3.54, 4.49, 1.0…
## $ sodium_mg            <dbl> 706.13, 545.34, 1074.58, 1087.46, 1051.24, 1529.22, 579.60, 390.74, 7…

2.3 Quick View

I love to take the first peek into a dataset with the amazing {skimr} package. Few takeaways:

  • per_serve_size doesn’t sound like it should be of type character
  • no missing values except for sodium_mg
skimr::skim(dat)
Data summary
Name dat
Number of rows 141
Number of columns 13
_______________________
Column type frequency:
character 3
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
menu_category 0 1 11 15 0 7 0
menu_items 0 1 8 42 0 141 0
per_serve_size 0 1 3 9 0 107 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
energy_k_cal 0 1.00 244.64 185.55 0 116.36 219.36 339.52 834.36 ▇▇▃▁▁
protein_g 0 1.00 7.49 8.34 0 0.65 4.79 10.88 39.47 ▇▃▁▁▁
total_fat_g 0 1.00 9.99 10.34 0 0.46 7.77 14.16 45.18 ▇▅▂▁▁
sat_fat_g 0 1.00 5.00 4.90 0 0.28 4.27 7.28 20.46 ▇▅▂▁▁
trans_fat_g 0 1.00 0.69 6.33 0 0.06 0.15 0.22 75.26 ▇▁▁▁▁
cholesterols_mg 0 1.00 26.35 50.33 0 1.51 8.39 31.11 302.61 ▇▁▁▁▁
total_carbohydrate_g 0 1.00 31.19 20.60 0 15.74 30.82 46.00 93.84 ▇▇▆▂▁
total_sugars_g 0 1.00 15.46 15.69 0 2.33 9.16 26.95 64.22 ▇▂▂▁▁
added_sugars_g 0 1.00 10.34 14.28 0 0.00 3.64 19.23 64.22 ▇▂▁▁▁
sodium_mg 1 0.99 362.06 473.16 0 43.90 152.02 534.24 2399.49 ▇▂▁▁▁

2.4 Data Quality

My favorite way of exploring missing data is to make it visible, using Nick Tierney’s amazing {naniar} package.

We can see the missing sodium_mg rows here.

dat %>% 
  vis_miss()

3 Interesting Questions

Since this is an open ended exploration, I will posit some questions which will guide the flow of further work.

  1. Which are the unhealthiest menu options, measured in terms of total fat content, total sugar, cholesterol, or sodium?
  2. Which are the most and least energy dense options?
  3. How does the Gourmet Menu differ from the Regular Menu?
  4. What patterns or clusters exist, if any, which can give us a clearer picture of this menu?

4 Feature Development

To aid answering many of these, I first need to create a few new features in the data set.

We go from 13 columns to 15 columns in the data set.

4.1 Serving Size

The serving size by default is a character, with the units (g) embedded in the column. Let’s clean this up, and rename the column to our convention {measurement}_{unit}.

dat <- dat %>% 
  mutate(per_serve_size = as.numeric(stringr::str_extract(per_serve_size, "\\d*"))) %>% 
  rename(per_serve_size_g = per_serve_size)
glimpse(dat, 80)
## Rows: 141
## Columns: 13
## $ menu_category        <chr> "Regular Menu", "Regular Menu", "Regular Menu", "…
## $ menu_items           <chr> "McVeggie™ Burger", "McAloo Tikki Burger®", "McSp…
## $ per_serve_size_g     <dbl> 168, 146, 199, 250, 177, 306, 132, 87, 173, 136, …
## $ energy_k_cal         <dbl> 402.05, 339.52, 652.76, 674.68, 512.17, 832.67, 3…
## $ protein_g            <dbl> 10.24, 8.50, 20.29, 20.96, 15.30, 24.17, 7.91, 5.…
## $ total_fat_g          <dbl> 13.83, 11.31, 39.45, 39.10, 23.45, 37.94, 15.08, …
## $ sat_fat_g            <dbl> 5.34, 4.27, 17.12, 19.73, 10.51, 16.83, 6.11, 5.7…
## $ trans_fat_g          <dbl> 0.16, 0.20, 0.18, 0.26, 0.17, 0.28, 0.24, 0.09, 0…
## $ cholesterols_mg      <dbl> 2.49, 1.47, 21.85, 40.93, 25.24, 36.19, 9.45, 5.1…
## $ total_carbohydrate_g <dbl> 56.54, 50.27, 52.33, 59.27, 56.96, 93.84, 46.36, …
## $ total_sugars_g       <dbl> 7.90, 7.05, 8.35, 3.50, 7.85, 11.52, 4.53, 2.73, …
## $ added_sugars_g       <dbl> 4.49, 4.07, 5.27, 1.08, 4.76, 6.92, 1.15, 0.35, 4…
## $ sodium_mg            <dbl> 706.13, 545.34, 1074.58, 1087.46, 1051.24, 1529.2…

4.2 Calorie Counts

I want to explore some of the data using calories as a covariate. Here’s the distribution of calories:

ggplot(dat, aes(x = energy_k_cal)) +
  geom_histogram(bins = 50) +
  labs(
    x = "Energy (kCal)",
    y = "Counts"
  )

I’ll use this distribution as a guide to create some bins for the energy column.

dat <- dat %>% 
  mutate(
    cal_cat = case_when(
      energy_k_cal <= 100 ~ "Low",
      energy_k_cal > 100 & energy_k_cal <= 400 ~ "Medium",
      energy_k_cal > 400 & energy_k_cal <= 600 ~ "High",
      energy_k_cal > 600 ~ "Very High"
    ),
    cal_cat = factor(cal_cat, levels = c("Very High", "High", "Medium", "Low"), ordered = TRUE)
  )
glimpse(dat, 80)
## Rows: 141
## Columns: 14
## $ menu_category        <chr> "Regular Menu", "Regular Menu", "Regular Menu", "…
## $ menu_items           <chr> "McVeggie™ Burger", "McAloo Tikki Burger®", "McSp…
## $ per_serve_size_g     <dbl> 168, 146, 199, 250, 177, 306, 132, 87, 173, 136, …
## $ energy_k_cal         <dbl> 402.05, 339.52, 652.76, 674.68, 512.17, 832.67, 3…
## $ protein_g            <dbl> 10.24, 8.50, 20.29, 20.96, 15.30, 24.17, 7.91, 5.…
## $ total_fat_g          <dbl> 13.83, 11.31, 39.45, 39.10, 23.45, 37.94, 15.08, …
## $ sat_fat_g            <dbl> 5.34, 4.27, 17.12, 19.73, 10.51, 16.83, 6.11, 5.7…
## $ trans_fat_g          <dbl> 0.16, 0.20, 0.18, 0.26, 0.17, 0.28, 0.24, 0.09, 0…
## $ cholesterols_mg      <dbl> 2.49, 1.47, 21.85, 40.93, 25.24, 36.19, 9.45, 5.1…
## $ total_carbohydrate_g <dbl> 56.54, 50.27, 52.33, 59.27, 56.96, 93.84, 46.36, …
## $ total_sugars_g       <dbl> 7.90, 7.05, 8.35, 3.50, 7.85, 11.52, 4.53, 2.73, …
## $ added_sugars_g       <dbl> 4.49, 4.07, 5.27, 1.08, 4.76, 6.92, 1.15, 0.35, 4…
## $ sodium_mg            <dbl> 706.13, 545.34, 1074.58, 1087.46, 1051.24, 1529.2…
## $ cal_cat              <ord> High, Medium, Very High, Very High, High, Very Hi…

4.3 Energy Density

Finally, I’ll add a feature called energy density, which is the number of calories per weight of food. I suspect things like cookies and cake (perhaps even the fizzy drinks?) would rank high on this feature.

dat <- dat %>% 
  mutate(
    energy_density = energy_k_cal / per_serve_size_g
  )
glimpse(dat, 80)
## Rows: 141
## Columns: 15
## $ menu_category        <chr> "Regular Menu", "Regular Menu", "Regular Menu", "…
## $ menu_items           <chr> "McVeggie™ Burger", "McAloo Tikki Burger®", "McSp…
## $ per_serve_size_g     <dbl> 168, 146, 199, 250, 177, 306, 132, 87, 173, 136, …
## $ energy_k_cal         <dbl> 402.05, 339.52, 652.76, 674.68, 512.17, 832.67, 3…
## $ protein_g            <dbl> 10.24, 8.50, 20.29, 20.96, 15.30, 24.17, 7.91, 5.…
## $ total_fat_g          <dbl> 13.83, 11.31, 39.45, 39.10, 23.45, 37.94, 15.08, …
## $ sat_fat_g            <dbl> 5.34, 4.27, 17.12, 19.73, 10.51, 16.83, 6.11, 5.7…
## $ trans_fat_g          <dbl> 0.16, 0.20, 0.18, 0.26, 0.17, 0.28, 0.24, 0.09, 0…
## $ cholesterols_mg      <dbl> 2.49, 1.47, 21.85, 40.93, 25.24, 36.19, 9.45, 5.1…
## $ total_carbohydrate_g <dbl> 56.54, 50.27, 52.33, 59.27, 56.96, 93.84, 46.36, …
## $ total_sugars_g       <dbl> 7.90, 7.05, 8.35, 3.50, 7.85, 11.52, 4.53, 2.73, …
## $ added_sugars_g       <dbl> 4.49, 4.07, 5.27, 1.08, 4.76, 6.92, 1.15, 0.35, 4…
## $ sodium_mg            <dbl> 706.13, 545.34, 1074.58, 1087.46, 1051.24, 1529.2…
## $ cal_cat              <ord> High, Medium, Very High, Very High, High, Very Hi…
## $ energy_density       <dbl> 2.393155, 2.325479, 3.280201, 2.698720, 2.893616,…

5 Graphical EDA

Now that I have the data sets prepared and ready, it’s time for the fun part - being creative and creating some interesting visuals!

There are four components to the EDA below:

  1. Unhealthiest menu exploration
  2. Energy fense foods exploration
  3. Gourmet vs Regular menu exploration
  4. Principal Component Analysis for multivariate exploration
# Custom plotting functions

make_dot_violin_plot <- function(dat,
                                 title = NULL,
                                 xlab = NULL,
                                 ylab = NULL,
                                 subtitle = NULL,
                                 legend_title = NULL,
                                 n_outliers = 1,
                                 annot_xnudge = 0.1,
                                 annot_xoffset = 0.5,
                                 annot_ynudge = 10,
                                 annot_curv = 0.2,
                                 label_size = 6,
                                 he_width = 0.5,
                                 he_alpha = 0.7,
                                 pt_size = 4,
                                 pt_width = 0.2,
                                 pt_alpha = 0.5){
  
  
  pdat <- dat %>%
    mutate(colorby = forcats::fct_reorder(str_wrap(colorby, 10), -y))
  
  outlier <- pdat %>% 
    arrange(-y) %>% 
    slice(n_outliers)
  
  g <- pdat %>% 
    ggplot(aes(x = forcats::fct_reorder(colorby, -y),
               y = y)) +
    ggdist::stat_halfeye(
      aes(thickness = stat(f * n)),
      width = he_width,
      position = position_nudge(x = .2),
      alpha = he_alpha
    ) +
    geom_quasirandom(
      aes(color = cat),
      size = pt_size,
      width = pt_width,
      alpha = pt_alpha,
      varwidth = TRUE
    ) +
    ggtitle(title) +
    labs(
      x = xlab,
      y = ylab,
      subtitle = subtitle,
      color = legend_title
    ) +
    theme(legend.position = "bottom")
  
  if (n_outliers > 0) {
    g <- g +
      annotate(
        "curve",
        x = as.numeric(outlier$colorby) + annot_xoffset,
        xend = as.numeric(outlier$colorby) + annot_xnudge,
        y = outlier$y + annot_ynudge,
        yend = outlier$y,
        curvature = annot_curv,
        arrow = arrow(length = unit(10, "pt"),
                      type = "closed")
      ) +
      annotate(
        "text",
        x = as.numeric(outlier$colorby) + annot_xoffset,
        y = outlier$y + annot_ynudge,
        label = outlier$menu_items,
        size = label_size,
        hjust = -0.1
      )
  }
  g
}

5.1 Unhealthiest Menu Options

Q: Which are the unhealthiest menu options, measured in terms of total fat content, total sugar, cholesterol, or sodium?

The plots below, each a combination of a jitter plot & a half-eye plot (itself is a combination of a density and an interval plot), are quite information rich.

The individual points allow us to see each data point, look for groupings if any and identify outliers. The density plot gives a visual for the each group’s clustering. The interval plot gives us a quick understanding of the median point (the black dot), as well as the quantiles (the thick & thin lines).

The X-axis is split by the Menu type while the colors are for the binned Calories feature I created.

For each graph, I’ve also found the largest outlier and put a little annotation for what the item is.

5.1.1 Fats

  • Highest Calories also have the highest fat content
  • Gourmet menu, expectedly, is rich in fat
  • Beverages, expectedly, are the least fatty
  • Breakfast and McCafe menu’s show a bi-modal distribution
dat %>% 
  select(y = total_fat_g, cat = cal_cat, colorby = menu_category, menu_items) %>% 
  make_dot_violin_plot(
    title = "How does Total Fat content vary by the menu?",
    ylab = "Total Fat (grams)",
    subtitle = "Each point represents the fat content for 1 serving of the menu item",
    legend_title = "Calories kCal",
    n_outliers = 1,
    annot_ynudge = 1.5,
    annot_curv = 0.1
  )

5.1.2 Sugars

  • Beverages, expectedly, are the worst. 60g of sugar for a large fanta!
  • McCafe menu’s - which are sold as breakfast items - are surprisingly sugary
dat %>% 
  select(y = total_sugars_g, cat = cal_cat, colorby = menu_category, menu_items) %>% 
  make_dot_violin_plot(
    title = "How does Total Sugar content vary by the menu?",
    ylab = "Total Sugar (grams)",
    subtitle = "Each point represents the sugar content for 1 serving of the menu item",
    legend_title = "Calories kCal",
    n_outliers = 1,
    annot_ynudge = 1.5,
    annot_curv = 0.1
  )

5.1.3 Sodium

This one really makes the skin crawl. The amount of sodium in the regular and gourmet menus should be a cause for concern for anyone consuming these on any regular basis. Considering the average recommended sodium is 2,300 mg per day, consuming a Ghee Rice with the Spicy Chicken would violate that in a single meal.

dat %>% 
  select(y = sodium_mg, cat = cal_cat, colorby = menu_category, menu_items) %>% 
  tidyr::drop_na() %>% 
  make_dot_violin_plot(
    he_width = .6,
    title = "How does Sodium content vary by the menu?",
    ylab = "Total Sodium (milligrams)",
    subtitle = "Each point represents the sodium content for 1 serving of the menu item",
    legend_title = "Calories kCal",
    n_outliers = 1,
    annot_ynudge = 1.5,
    annot_curv = 0.1
  )

5.1.4 Cholesterol

Quite a few outliers here, for the gourmet, breakfast and regular menus. The popular Spicy Chicken burger is certainly the worst.

dat %>% 
  select(y = cholesterols_mg, cat = cal_cat, colorby = menu_category, menu_items) %>% 
  make_dot_violin_plot(
    he_width = 1.5,
    title = "How does Cholesterol content vary by the menu?",
    ylab = "Total Cholesterol (milligrams)",
    subtitle = "Each point represents the cholesterol content for 1 serving of the menu item",
    legend_title = "Calories kCal",
    n_outliers = 1,
    annot_ynudge = 1.5,
    annot_curv = 0.1
  )

5.1.5 Carbohydrates

Carbs are high, unsurprisingly, across the board for the McD menu.

dat %>% 
  select(y = total_carbohydrate_g, cat = cal_cat, colorby = menu_category, menu_items) %>% 
  make_dot_violin_plot(
    he_width = .65,
    title = "How does Carbohydrate content vary by the menu?",
    ylab = "Total Carbohydrate (grams)",
    subtitle = "Each point represents the carbohydrate content for 1 serving of the menu item",
    legend_title = "Calories kCal",
    n_outliers = 1,
    annot_ynudge = 1.5,
    annot_curv = 0.1
  )

5.2 Energy Dense Foods

Aha! As I suspected, the Muffins are the worst offenders for calories/gram of food. However, look at the Condiments! Those little ketchup and mustard packets are super energy dense!

dat %>% 
  select(y = energy_density, cat = cal_cat, colorby = menu_category, menu_items) %>% 
  make_dot_violin_plot(
    he_width = 1.5,
    title = "How does Carbohydrate content vary by the menu?",
    ylab = "Total Carbohydrate (grams)",
    subtitle = "Each point represents the carbohydrate content for 1 serving of the menu item",
    legend_title = "Calories kCal",
    n_outliers = 1,
    annot_ynudge = 0.15,
    annot_curv = -0.1 
  )

5.3 Gourmet vs Regular

Q: How does the Gourmet Menu differ from the Regular Menu?

Across the board, the Gourmet menu has items with higher values. Except Energy Density, all the other metrics are right-shifted. This is expected, as Gourmet items are regular items with extra oomph!

dat %>%
  filter(menu_category %in% c("Regular Menu", "Gourmet Menu")) %>%
  filter(trans_fat_g < 20) %>% 
  select(menu_category, where(is.numeric)) %>%
  tidyr::pivot_longer(-menu_category) %>%
  ggplot(aes(x = value,
             fill = menu_category)) +
  geom_density(alpha = 0.8) +
  facet_wrap(~ name,
             scales = "free"
  ) +
  ggtitle(
    label = "Comparison of the Gourmet & Regular Menu Options"
  ) +
  labs(
    x = NULL,
    y = "Probability density",
    fill = NULL,
    caption = "Outlier '5 piece Chicken Strips' with `trans_fat` of 75.3g removed"
  ) +
  theme(legend.position = "top",
        plot.subtitle = element_text(face = "italic"))

5.4 Principal Component Analysis

One of my favorite methods of looking at multivariate numerical data is the simple, yet powerful tool - PCA. It’s a great way to extract layers upon layers of information about the dataset from just a few plots. Let’s explore this dataset using PCA.

5.4.1 Calc Components

dat %>% 
  select(where(is.numeric)) %>%
  # Impute the missing sodium_mg values to the median
  mutate(sodium_mg = ifelse(is.na(sodium_mg), median(sodium_mg, na.rm = TRUE), sodium_mg)) %>% 
  scale() %>% 
  as.data.frame() -> dat_scaled

pca <- PCA(dat_scaled, graph = FALSE)
pca
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 141 individuals, described by 12 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

5.4.2 Explore Variables

fviz_eig(pca, addlabels = TRUE, ylim = c(0, 50))

fviz_pca_var(pca, repel = TRUE, alpha.var = "contrib")

fviz_contrib(pca, choice = "var", axes = 1, top = 10)

fviz_contrib(pca, choice = "var", axes = 2, top = 10)

fviz_cos2(pca, choice = "var", axes = 1:2)

5.4.3 Explore Individuals

fviz_pca_ind(pca, col.ind = dat$menu_category, select.ind = list(cos2 = 0.7))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 16 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).

fviz_pca_biplot(pca, col.ind = dat$menu_category, select.ind = list(cos2 = 0.8))


That was a fun exploration of these data. What else can you think of to explore?

Cheers!